Twitter Finance Text Analysis + Sentiment Extension

May 2025

Lilian Hu

Project 2 Overview

  • Context: X (formerly Twitter) as a real-time platform for financial discussions
  • Datasets:
    1. Financial Tweets from Kaggle (tweets, timestamps, metadata)
    2. Company lookup table (stocks_cleaned.csv) mapping tickers to company names
    3. Key question: Which stocks dominate financial discussions on X, and how are they co-mentioned?

Data Wrangling & Cleaning

  • Lowercase all tax (str_to_lower())
  • Remove URLs (str_replace_all("https?://\\S+", "")
  • Strip punctuation (str_replace_all("[[:punct:]]", " "))
  • regex "\\$[A-Za-z]{1,6}" → remove $, uppercase tickers
  • Mapping: join with stocks_cleaned for company_name
library(tidyverse)
stocks_cleaned <- read.csv("stocks_cleaned.csv")
tweets <- read_csv("stockerbot_export.csv") |>
  select(text, timestamp, source) |>
  rename(timestamp_original = timestamp) |>
  mutate(
    text_clean = text |>
      str_to_lower() |>
      str_replace_all("https?://\\S+", "") |>
      str_replace_all("[[:punct:]]", " ")
  ) |>
  filter(!is.na(text_clean))

stocks_cleaned <- stocks_cleaned |> 
  rename(ticker = ticker, company_name = name) |>
  mutate(ticker = str_to_upper(ticker))

tweets <- tweets |>
  mutate(tickers_found = str_extract_all(text_clean, "\\$[A-Za-z]{1,6}")) |>
  unnest(tickers_found) |>
  mutate(tickers_found = str_remove(tickers_found, "\\$") |> str_to_upper()) |>
  left_join(stocks_cleaned, by = c("tickers_found" = "ticker")) |>
  filter(!is.na(company_name))

ticker_counts <- tweets |>
  count(company_name, sort = TRUE)

Top Mentions

Co-Occurrences

Results & Key Findings

  • Dominant mentions: Netflix, Amazon, Alphabet (Google), Facebook, Microsoft, Apple
  • Interpretation: High mention volumes reflect investor focus on earnings, news, and events.
  • Co-occurrence: Pairs like Apple–Microsoft and Amazon–Alphabet indicate sector groupings.

Extra: Sentiment Analysis

Goal: Gauge positive vs. negative tone in financial tweets.

  • Approach:
  1. Tokenize cleaned tweet text.
  2. Join tokens with Bing sentiment lexicon.
  3. Compute net sentiment per ticker (positive − negative counts).

Sentiment Analysis

  • Tidytext
  • Using Loughran-McDonald lexicon
  • inner join
library(tidyverse)
library(tidytext)
library(textdata)

if (!"tweet_id" %in% names(tweets)) {
  tweets <- tweets |> mutate(tweet_id = row_number())
}

tweet_words <- tweets |> 
  select(tweet_id, company_name, text_clean) |> 
  unnest_tokens(word, text_clean)

lm_lex <- lexicon_loughran() |> 
  filter(sentiment %in% c("positive", "negative"))

tweet_sent_fin <- tweet_words |> 
  inner_join(lm_lex, by = "word") |> 
  mutate(score = if_else(sentiment == "positive", 1, -1))

tweet_scores_fin <- tweet_sent_fin |> 
  group_by(tweet_id, company_name) |> 
  summarise(tweet_score = sum(score), .groups = "drop")

company_fin <- tweet_scores_fin |> 
  group_by(company_name) |> 
  summarise(net_sent = sum(tweet_score),
            n_tweets = n(), .groups = "drop") |> 
  filter(n_tweets >= 50)          
top20_fin <- company_fin |> 
  slice_max(order_by = abs(net_sent), n = 20)

ggplot(top20_fin,
       aes(x = reorder(company_name, .data$net_sent),
           y = net_sent,
           fill = net_sent > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(title = "Net sentiment in tweets (Loughran-McDonald lexicon)",
       x = "Company",
       y = "Positive – Negative word count") +
  scale_fill_manual(values = c("TRUE" = "steelblue",
                               "FALSE" = "firebrick")) +
  theme_minimal()